Identification and clinical validation of key genes as the potential biomarkers in colorectal adenoma

Background Colorectal cancer (CRC), ranking third in cancer prevalence and second in mortality worldwide, is mainly derived from colorectal adenoma (CRA). CRA is a common benign disease in the intestine with rapidly increasing incidence and malignant potential. Therefore, this study aimed to recognize significant biomarkers and original pathogenesis in CRA. Methods Transcriptome data of GSE8671, GSE37364, and GSE15960 were downloaded from the Gene Expression Omnibus (GEO) datasets, and differentially expressed genes (DEGs) were screened. Functional pathways enrichment, protein–protein interaction (PPI) network, stem-correlation analysis, CIBERSORT, risk score and survival analyses were performed. RT-qPCR and immunohistochemical staining were applied to verify our results. Results Screening for significant DEGs in each dataset, we identified 230 robust DEGs, including 127 upregulated and 103 downregulated genes. Functional pathways enrichment showed that these DEGs were distinctly enriched in various tumor-associated pathways, such as growth factor activity, extracellular structure organization, neutrophil activation, and inflammatory response. We filtered out two hub genes via STRING and Modules analysis, including CA2 and HSD11B2. Stem-correlation analysis displayed that hub genes were negatively associated with stem-related genes (Olfm4, CD44, CCND1 and MYC). The CIBERSORT algorithm indicated that Macrophage2, activated mast cells, and Neutrophils promoted CRA progression through inflammation. Survival analysis showed that CA2 and HSD11B2 were positively associated with survival outcomes in CRC. Conclusion Our study has successfully identified the critical role of two core genes in the development and oncogenesis of CRA, which provides novel insight into the underlying pathogenesis, potential biomarkers and therapeutic targets. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-022-10422-9.


Introduction
Colorectal cancer (CRC) is a malignant tumor in the intestine with a greatly high incidence, ranking second in male malignant tumors and third in females [1]. In the United States, approximatedly 150,000 new cases of CRC occurred in 2021, with about 53,000 deaths [2]. As developing countries progress, the incidence of colon cancer is gradually increasing globally, and new cases are expected to increase to 2.5 million by 2035 [3]. With changes in diet and lifestyle, the incidence of CRC in China has also increased rapidly, accounting for over 40% of global morbidity in 2020 [4]. The latest studies report that CRC incidence in people under 50 has increased significantly [5], and CRC treatment has greatly burdened patients and society. CRC is mainly sporadic, and about 85% of colorectal cancers originate from colorectal adenoma (CRA), of which 80% of adenomas have APC mutations, on which they accumulate multiple gene mutations (KRAS, p53, and SMAD4) and gradually evolve into cancer [6,7]. The sequence of colorectal adenoma-cancer evolution usually takes 5-15 years [8], creating an optimal window period for the clinical prevention and treatment of CRA and CRC. Therefore, prevention of CRA and timely and effective blocking of adenoma-cancer sequences are crucial to reducing CRC incidence.
CRA is a benign lesion derived from the glandular epithelium of the colorectum. Pathological classification is divided into three categories, including tubular adenoma, villous adenoma, and villous tubular adenoma. Besides, adenomas can be classified into low-risk and high-risk adenomas based on their malignant potential. High-risk adenomas include histopathologic diagnosis of villous or tubular villous adenomas, ≥ 10 mm in diameter, with or without dysplasia, which is considered as a precancerous lesion of CRC. Epidemiological studies have shown that about 10% of adenomatous polyps will develop colorectal cancer, and about 25% of high-grade adenomas will develop colorectal cancer [9]. CRA incidence is obviously increasing due to different risk factors, such as genetics, age, BMI, alcohol and exercise [10,11]. Colonoscopy screening and endoscopic adenoma eradication are currently the most effective methods for detecting and treating adenomas, but adenomas are prone to recurrence after resection. Domestic studies have shown that the recurrence rate of CRA can reach 61.09% within two year, and the recurrence rate gradually increases with time [12]. High-risk factors for postoperative recurrence of adenoma include fragment resection, intraoperative bleeding, high-grade adenoma, and lesion size ≥ 40 mm [13]. In addition, after adenoma treatment, according to its specific pathological classification, clinical guidelines recommend repeat colonoscopy every 1-3 years. Lengthy bowel preparations, unbearable pain associated with colonoscopy, intraoperative perforation, and surgical risks of postoperative wound bleeding and infection significantly reduce the patients' enthusiasm for examination and treatment. Therefore, finding essential genes, exploring their potential pathogenesis of colorectal adenoma, and developing gene-targeted drugs are urgent clinical and scientific problems to be solved.
Herein, we systematically analyzed transcriptomic characteristics of adenoma based on GEO datasets. DEGs (differential expression genes) among colorectal mucosa and adenoma were analyzed and obtained, followed by GO/KEGG, GSEA, and protein-protein interaction (PPI) network analysis. We also exhibited the immune landscape of CRA to the mucosa. Finally, we identified two core genes, CA2 and HSD11B2, confirming their expression patterns through real-time qPCR and immumohistochemical staining, and verifying the survival rates of each essential gene. In conclusion, our study will improve the understanding of the pathogenesis in CRA, and the core genes may serve as original biomarkers and therapeutic targets for colorectal adenoma and cancer.

Collection of data from GEO datasets
The RNA expression profiles of CRAs were obtained from GEO datasets (http:// www. ncbi. nlm. nih. gov/ geo/), including GSE8671 [14], GSE15960 [15], and GSE37364 [16], based on the GPL570 platform. Specifically, our study contained 140 samples, including 65 CRAs and 75 mucosal samples (Table 1). Their RNA-sequencing data were processed for identifying DEGs. Datasets that met the following criteria were eligible: 1. CRA and mucosa; 2. datasets contained the transcriptome data from CRA and mucosa; 3. more than 5 pairs of samples among each dataset.
The quantile normalization method normalizes gene expression intensities.

GO and KEGG pathway analysis
GO analysis explains the biological function of specific genes from three parts, involed in a cellular component, molecular function, and biological process. KEGG [17,18], analyzing the role of genes, was employed to find which biological pathways specific genes were enriched in CRAs. Performing GO and KEGG pathway analysis based on DEGs between adenoma and mucosa from GEO datasets via the R package (clusterProfiler, 3.14.3 version).

Gene Set Enrichment Analysis (GSEA)
RNA-sequence expression of mucosa and adenoma was analyzed by GSEA (http:// softw are. broad insti tute. org/ gsea/ version 4.0); clusterProfiler, 3.14.3 version.) referring to h.all.v7.2.symbols.gmt [Hallmarks] and the obtained gene sets were compared with known diseaseassociated gene sets to characterize CRAs. GSEA analysis was based on the normalized data among GSE8671, GSE15960 and GSE37364.

Establishment of PPI network
We applied the STRING (https:// string-db. org) online database to build the PPI network based on the common DEGs. After obtaining the primary PPI network, we performed Cytoscape (version 3.8.2) software to visualize and analyze the gene interaction network. Use Cytoscape software's MCODE plug-in to identify the basic modules of the entire network, including upregulated and downregulated genes.

Tissues
Ten pairs of fresh adenoma and adjacent mucosa tissues were collected from the Endoscopy Center of Nanjing Madical University Jiangsu province hospital. Besides, pathological sections of 10 cases of paraffin-embedded tissues (mucosa and adenoma) were obtained from the Department of digestive endoscopy, Nanjing Madical University Jiangsu province hospital too. Ethical approval was obtained for this study from the Ethical Committee of Medical Research, Jiangsu province Hospital of Nanjing Medical University (2018-SR-258).

Validation of mRNA and protein expressions of hub genes
To confirm the expression of CA2 and HSD11B2, we performed RT-qPCR, the immunohistochemical staining (IHC) for CRAs and normal tissues. We utilized the RNeasy Protect Mini kit (Tiangen) to extract total RNA of mucosal tissues and CRAs. The reverse transcription reagent purchased from Promega was applied to reverse the transcription of RNA. All experimental procedures followed the instructions of the kit. SYBR Green Master mix (Vazyme) was used for polymerase chain reaction (PCR). The workflow includes initial denaturing (15 min at 95 °C), 40 cycles of 95 °C for 30 s and 60 °C for 1 min in the 7500 fast real-time PCR system (BioRad). Primers for RT-qPCR were recited in Supplementary Table S1. 5 μm slides of CRA and normal mucosa were orderly dewaxed in xylene baths for 3 times, then rehydrated with graded alcohol series and retrieved in a pressure cooker with sodium citrate buffer (pH 6.0) heating for 10 min. The CA2 (GeneTex, GTX105562) and HSD11B2 (Proteintech, 14,192-1-AP) antibodies were utilized for IHC according to their protocols. The IHC steps followed our previous article [19]. We took two field shots on each slide. The IHC staining scores (IS) were classified into four score ranks: 0, negative; 1, weak; 2, moderate; and 3, strong. The percentage of positively stained cells (PS): 0 (< 5%), 1 (5-25%), 2 (25-50%), 3 (50-75%) and 4 (75-100%).

Analysis of ROC and AUC
MedCalc software was utilized for ROC analysis of CA2 and HSD11B2 based on RT-qPCR results and GEO datasets (GSE71187 and GSE41657). AUC and ROC were applied to assess the predictive value of the hub genes for CRA and CRC.

Relationship between gene expression and immune cell infiltration
The CIBERSORT (Cell-type Identification by Estimating Relative Subsets Of RNA Transcripts) was utilized to research the association between colorectal tissues ( mucosa and adenoma) and 22 immune cells. At the same time, we analyzed the association between the expression of core genes (CA2 and HSD11B2) and tumor-infiltrating immune cells via R.

GEPIA database analysis
The GEPIA database (http:// gepia2. cancer-pku. cn/# index) was used to explore the relationship between the gene expression and prognosis in different tumors based on TCGA datasets. In this study, we applied GEPIA to analyze the relationship between expression and prognosis of CA2 and HSD11B2 according to 272 cases of CRC. Besides, the correlation between targeted gene expression and tumor stage was also determined by the GEPIA database. The univariate Cox regression analysis was performed to establish the risk score of hub genes-related prognostic signature based on the TCGA cohort. For the survival analysis, the patients were split medially (50% high-expression and 50% low-expression), and adding the 95% CI as a dotted line.

Statistical analysis
All the experimental analysis results were shown as means ± the standard deviation (SD). The differences between various groups were analyzed by Graphpad Prism 8. The t-test analysis of variance was utilized to evaluate the differences between the two groups. P-value < 0.05 was considered to be statistical significance. All bioinformatic analysis was performed via R (V 4.1.1) software.

Identification of DEGs in CRA
The development of almost CRC follows the mucosa-adenoma-cancer sequence. In order to investigate the significant biological functions of critical DEGs in the evolution of CRA (Fig. 1A), we have performed bioinformatics analysis (GO, KEGG, PPI, and CIBERSORT) in depth ( Fig. 1B). First, we selected and downloaded three databases on CRA from GEO, including GSE8671, GSE15960, and GAE37364. 75 CRAs and 65 mucosal tissues were enrolled in this study ( Table 1). The three GEO datasets were normalized, and the results are shown in Fig. 1C-E. The volcano plot of each data set was constructed ( Fig. 2A), indicating the difference in molecular expression profiles between CRA and mucosa. The DEGs also were displayed by heatmaps in Fig. 2B. There were 2252 DEGs in GSE8671, including 918 upregulated and 1334 downregulated genes; 2992 DEGs in GSE15960, including 1487 upregulated and 1505 downregulated genes; 1598 DEGs in GSE37364, including 777 upregulated and 821downregulated genes. Through the Venn diagram, we identified 127 upregulated genes and 103 downregulated genes in common among the three datasets (Fig. 2C).

Functional enrichment analysis based on the identical DEGs
Firstly, GO and KEGG pathway enrichment analyses were carried out to determine the molecular function of the common DEGs (Tables 2, 3 and 4). In up-regulated DEGs, KEGG analysis indicated that the Wnt signaling pathway played a role in CRA formation, and GO analysis revealed the main terms included growth factor activity, extracellular structure organization, neutrophil activation, and inflammatory response ( Fig. 3A and B). KEGG analysis showed nitrogen metabolism and tryptophan metabolism enriched in the down-regulated DEGs (Fig. 3C). Similarly, GO analysis displayed that the main terms included carbonate dehydratase activity and bicarbonate transport (Fig. 3D). Nextly, GSEA was applied to the predicted biological function of three datasets. Venn plots displayed that the same functional pathways were enriched among the three databases (Fig. 4A). The analysis results showed that DNA repair, E2F, MYC, mTORC1, glycolysis, and mitotic spindle were dramatically enriched in CRAs (Fig. 4B-D).

Constructing PPI network and module analysis
The STRING database was applied to create PPI networks to further explore the interactions between these DEGs. The top 3 modules were recognized from the upregulated PPI network by employing MCODE (Fig. 5A). The cluster 1 module included 6 nodes and 15 edges, the cluster 2 module included 5 nodes (Fig. 5B-C). The top 3 modules were identified from the upregulated PPI network (Fig. 6A). The cluster 1 module included 6 nodes and 14 edges; clusters 2 and 3 had 3 nodes and 3 edges, respectively ( Fig. 6B-D).
As we know, the progression of CRC follows the mucosa-adenoma-adenocarcinoma sequence. Whether the genes are upregulated or downregulated in adenomas could persist in adenocarcinoma. We analyzed GEO datasets containing normal mucosa, adenoma, and adenocarcinoma (GSE37364 and GSE71187). The results showed that criticalgenes (REG1A and TIMP1) continued to increase. However, HSD11B2 and CA2 were reduced gradually ( Supplementary Fig. 1).

Exploring the correction of hub genes and stem-related genes
APC mutation in the intestine played an essential role in the progression of CRA [21]. Wnt/β-catenin is closely related to stemness and promotes tumor proliferation via regulation of stem genes [22]. KEGG analysis of up-regulated genes showed Wnt/β-catenin had already enriched in CRA. The downstream genes of β-catenin include Lgr5, MYC, and CCND1. So, in GEO datasets, we verified the association of hub genes and stemness-related genes (Lgr5, MYC, CCND1, CD44, Olfm4, and ALCAM). TIMP1 was positively associated with the stemnessrelated genes (p-value < 0.05), while CA2 and HSD11B2 had opposite trends (p-value < 0.05) (Fig. 7A and B). For example, TIMP1 was positively associated with MYC, while HSD11B2 and CA2 were negative (Fig. 7C-F). Besides, TIMP1 was positively associated with CCND1 and Olfm4, CA2 and HSD11B2 were negatively related with CCND1 and Olfm4 in GSE8761and GSE37364 ( Supplementary Fig. 2). However, REG1A was not related to part of stem genes in GSE8671 and GSE37364.

Validation of differentially expressed levels for hub genes
To verify the difference between these hub genes between adenoma and mucosa, we conducted RT-qPCR and immunohistochemical staining. RT-qPCR displayed that the mRNA expression level of hub genes was higher in the adenoma than in normal mucosa except TIMP1 (Fig. 8A-D). Combined with the above experimental  results, we identified CA2 and HSD11B2 as core genes. Then, the ROC and AUC were applied to predict the diagnostic value of CA2 and HSD11B2 in distinguishing CRA from mucosa and CRC from CRA. The area under the curve (AUC) values of hub genes (CA2 and HSD11B2) and combination were 0.951, 0.864, and 0.951, respectively (Fig. 8E). In GSE41657and GSE71187, the ROC and AUC of hub genes could remarkably distinguish adenoma from mucosa or cancer from adenoma ( Supplementary Fig. 3). During the sequence of mucosaadenoma-carcinoma, the hub genes expression increased gradually and showed great significance. Next, we performed IHC to validate the protein levels of hub genes between mucosa and CRA. Consistent with the trend in the mRNA, the HSD11B2 and CA2 protein levels were reduced significantly (Fig. 9A-B). The area under the curve (AUC) values of CA2 and HSD11B2 and combination were 0.784, 0.674, and 0.831 according to IHC scores, respectively (Fig. 9C).

Exploration of immune landscape of CRA GEO datasets
Tumor-infiltrating immune cells analysis was conducted by CIBERSORT in GSE8671 and GSE37364. Figure 10A displays the percentage of 24 immune cells infiltration in each sample. Infiltrating immune cells showed different infiltration abundance in adenoma or mucosa. We found the same infiltration trend of immune cells when comparing the two datasets (Fig. 10B). Naïve CD4 T cells,  activated/resting memory CD4 T cells, macrophage M0, activated mast cells, and neutrophils were highly infiltrating in adenoma. Besides, Treg cells were enriched in adenoma in GSE8671, while it was not significant in GSE37364. However, CD8 T cells, follicular helper T cells, resting mast cells, and macrophage M2 highly infiltrated the mucosa. In the two GEO datasets, CA2 and HSD11B2 were negatively associated with neutrophils, activated mast cells and macrophage M0, positively with resting mast cells and M2 (Fig. 10C). These hub genes may activate neutrophils, mast cells and macrophage M0 and produce different inflammatory factors to promote the proliferation of adenoma epithelium.

Risk score and survival analysis of CA2 and HSD11B2
Finally, we explored the potential prognostic value of hub genes for CRC patients by risk score and Kaplan-Meier (KM) survival analysis and log-rank test. In COAD, CA2 and HSD11B2 were significantly decreased (Fig. 11A-B). To evaluate a robust risk signature for clinical use, we analyzed the risk score distribution, survival time, and core genes expression based on the TCGA dataset (Fig. 11C). Low-expressed CA2 and HSD11B2 were associated with poor prognosis. GEPIA was utilized to perform survival analysis based on colorectal adenocarcinoma samples from TCGA data. In the analysis of overall survival, high CA2 expression was related to prolonged OS (p-value = 0.024) (Fig. 11D), while HSD11B2 was not associated with OS (Fig. 11F). In the analysis of DFS (Disease-free survival), high HSD11B2 was related to prolonged DFS (p-value = 0.022) (Fig. 11G). Besides, CA2 was not associated with DFS (Fig. 11E). Finally, there was no relationship between hub genes and tumor stages in CRC (Supplementary Fig. 4).

Discussion
Colorectal adenomas are precancerous lesions of colorectal cancer with high malignant potential, so the timely detection and diagnosis of adenomas are of great significance. At present, studies mainly focus on the discovery of CRC-specific biomarkers. However, colorectal adenomas are rarely reported, so finding hub genes that could drive the progression and deterioration of adenomas is more clinically meaningful for predicting high-risk adenomas. This study focused on colorectal mucosa and adenoma gene expression profiles via detailed bioinformatic analysis and uncovered significant regulatory signaling pathways and core genes.
Our study identified 230 robust DEGs by comparing genes expressed in colorectal mucosa and adenoma samples in three GEO datasets (GSE8671, GSE15960 and GSE37364), which included 127 upregulated and 103 downregulated genes. GO enrichment analysis of all genes indicated that growth factor activity, extracellular structure organization, neutrophil activation and inflammatory response were more potent in CRA samples than in mucosa samples. KEGG pathway that was enriched in CRA mainly included Wnt signaling pathway. Nitrogen and tryptophan metabolism was reduced in CRA. GO analysis revealed that growth factor activity, extracellular structure organization, neutrophil activation, and inflammatory response were enriched in CRA. Growth factor and extracellular structure organization can activate the cell proliferation signaling pathway and provide a sustainable 3D growing environment. Tumors can attract neutrophils to the cancer site via pro-inflammatory cytokine secretions and induce a switch to pro-tumoral (or N2) neutrophils, which support the metastatic spread and have an immunosuppressive role [23]. So, neutrophil activation and inflammatory response may contribute to excessive adenoma cell proliferation. GSEA showed that E2F, MYC, mTORC1, glycolysis and mitotic spindle were significantly enriched in CRA. E2F and MYC played a vital role in the proliferation of tumor cells [24,25]. mTORC1 is involved in the metabolic regulation of tumors, and the upregulation of glycolysis is an essential  [26,27]. The oncogenes and metabolic reprogramming may promote the progression of CRA. Next, the hub genes, screened by the PPI network and MCODE, were verified through the GEO datasets, including CA2 and HSD11B2. Combined with their expression level in CRA, CA2 and HSD11B2 were downregulated with statistical significance (p-value < 0.05). The relationship of stemness analysis suggested hub genes were negatively associated with tumor-stem genes. We also conducted tumor-infiltrating immune cells analysis by CIBERSORT in GSE8671 and GSE37364, and found different infiltration abundance in adenoma or mucosa. Hub genes were explored as the potential prognostic value for CRC patients by log-rank test and KM survival analysis. Therefore, according to our present research results, we hypothesized that CA2 and HSD11B2 might serve as biomarkers for the early diagnosis of CRA. Carbonic anhydrase 2 (CA2) belongs to human carbonic anhydrases (CAs), a well-defined group of metal enzymes that catalyze carbon dioxide into bicarbonate [28]. CA2 functions to regulate ion transport and pH balance, which permeates many biological processes. CAs variants have been linked to ulcers, osteoporosis, obesity, and cancer [29]. The immunohistochemistry results in HCC revealed that CA2 expression levels were lower in tumor tissues than in adjacent tissues. The KM analysis demonstrated that DFS and OS were higher in the CA2 high expression group than in the CA2 low expression group (p-value < 0.05) [30]. Low CA2 expression is negatively correlated with cancer size, distant metastasis, pathological stage, and poorer overall survival in gastric cancer [31,32]. The clinicopathological correlation analysis showed that CA2 was significantly downregulated in tumor metastases, such as hepatocellular carcinoma (p-value = 0.026) [33]. Low CA2 expression may promote adenoma cell stemness and serve as a biomarker for high-risk adenomas.
11-hydroxysteroid dehydrogenase (HSD11B2) is a catalytic enzyme that converts cortisol to cortisone and corticosterone to dehydrocorticosterone in vivo. HSD11B2, as a critical enzyme, can convert cortisol to inactive cortisone and accelerate tumor progression and metastasis [34]. Knockout of HSD11B2 promoted tumor angiogenesis (expression of EGFR and VEGFA), cell proliferation, and invasion in oral cancer cells [35]. HSD11B2 expression was significantly reduced in CRC tissues, which upregulated the expression of fibroblast growth factor binding protein Fig. 8 Screening and verification of the hub genes by RT-qPCR. A-D The relative expression of REG1A, CA2 and HSD11B2 between colorectal mucosa and adenoma were measured by RT-qPCR (p-value < 0.01). TIMP1 had no significant difference. E ROC curve with corresponding AUC value for hub genes when classifying CRA from mucosa. *** indicates p-value < 0.001, ** indicates p-value < 0.01, NS, none significance 1 (Fgfbp1) and subsequently increased the phosphorylation of AKT to enhance cell migration and invasion [36]. HSD11B2 down-regulation in adenoma may promote its proliferation by promoting stemness and proinflammation.
Overall, in the study, we systematically explored the differences in molecular expression profiles of colorectal mucosa and adenomas, elucidating enriched pathways, hub genes (CA2 and HSD11B2), disease prognosis and immune patterns. However, our study had several limitations. First, more large clinical samples are needed to verify the expression of CA2 and HSD11B2. Moreover, the molecular functions of these hub genes in CRA remained unclear and needed to be verified. Using shRNA targeting these hub genes will further strengthen the reliability of this study.

Conclusion
In conclusion, using various GEO datasets, we identified various significant DEGs in CRA, and found two hub genes that can be considered as novel and potential biomarkers of CRA. We further used the TCGA databases as a validation dataset to confirm the prognosis among CRC patients. Therefore, our research results present innovative and credible biomarkers for CRA, which will serve as a risk factor for predicting the malignant transformation of adenomas and be helpful for further clinical applications in CRA and CRC diagnosis, targeted therapy, and prognosis.